34 research outputs found
Distributed Dictionary Learning
The paper studies distributed Dictionary Learning (DL) problems where the
learning task is distributed over a multi-agent network with time-varying
(nonsymmetric) connectivity. This formulation is relevant, for instance, in
big-data scenarios where massive amounts of data are collected/stored in
different spatial locations and it is unfeasible to aggregate and/or process
all the data in a fusion center, due to resource limitations, communication
overhead or privacy considerations. We develop a general distributed
algorithmic framework for the (nonconvex) DL problem and establish its
asymptotic convergence. The new method hinges on Successive Convex
Approximation (SCA) techniques coupled with i) a gradient tracking mechanism
instrumental to locally estimate the missing global information; and ii) a
consensus step, as a mechanism to distribute the computations among the agents.
To the best of our knowledge, this is the first distributed algorithm with
provable convergence for the DL problem and, more in general, bi-convex
optimization problems over (time-varying) directed graphs
Hybrid Random/Deterministic Parallel Algorithms for Nonconvex Big Data Optimization
We propose a decomposition framework for the parallel optimization of the sum
of a differentiable {(possibly nonconvex)} function and a nonsmooth (possibly
nonseparable), convex one. The latter term is usually employed to enforce
structure in the solution, typically sparsity. The main contribution of this
work is a novel \emph{parallel, hybrid random/deterministic} decomposition
scheme wherein, at each iteration, a subset of (block) variables is updated at
the same time by minimizing local convex approximations of the original
nonconvex function. To tackle with huge-scale problems, the (block) variables
to be updated are chosen according to a \emph{mixed random and deterministic}
procedure, which captures the advantages of both pure deterministic and random
update-based schemes. Almost sure convergence of the proposed scheme is
established. Numerical results show that on huge-scale problems the proposed
hybrid random/deterministic algorithm outperforms both random and deterministic
schemes.Comment: The order of the authors is alphabetica
On the impact of activation and normalization in obtaining isometric embeddings at initialization
In this paper, we explore the structure of the penultimate Gram matrix in
deep neural networks, which contains the pairwise inner products of outputs
corresponding to a batch of inputs. In several architectures it has been
observed that this Gram matrix becomes degenerate with depth at initialization,
which dramatically slows training. Normalization layers, such as batch or layer
normalization, play a pivotal role in preventing the rank collapse issue.
Despite promising advances, the existing theoretical results (i) do not extend
to layer normalization, which is widely used in transformers, (ii) can not
characterize the bias of normalization quantitatively at finite depth.
To bridge this gap, we provide a proof that layer normalization, in
conjunction with activation layers, biases the Gram matrix of a multilayer
perceptron towards isometry at an exponential rate with depth at
initialization. We quantify this rate using the Hermite expansion of the
activation function, highlighting the importance of higher order ()
Hermite coefficients in the bias towards isometry
Batch Normalization Orthogonalizes Representations in Deep Random Networks
This paper underlines a subtle property of batch-normalization (BN):
Successive batch normalizations with random linear transformations make hidden
representations increasingly orthogonal across layers of a deep neural network.
We establish a non-asymptotic characterization of the interplay between depth,
width, and the orthogonality of deep representations. More precisely, under a
mild assumption, we prove that the deviation of the representations from
orthogonality rapidly decays with depth up to a term inversely proportional to
the network width. This result has two main implications: 1) Theoretically, as
the depth grows, the distribution of the representation -- after the linear
layers -- contracts to a Wasserstein-2 ball around an isotropic Gaussian
distribution. Furthermore, the radius of this Wasserstein ball shrinks with the
width of the network. 2) In practice, the orthogonality of the representations
directly influences the performance of stochastic gradient descent (SGD). When
representations are initially aligned, we observe SGD wastes many iterations to
orthogonalize representations before the classification. Nevertheless, we
experimentally show that starting optimization from orthogonal representations
is sufficient to accelerate SGD, with no need for BN
Decentralized Dictionary Learning Over Time-Varying Digraphs
This paper studies Dictionary Learning problems wherein the learning task is
distributed over a multi-agent network, modeled as a time-varying directed
graph. This formulation is relevant, for instance, in Big Data scenarios where
massive amounts of data are collected/stored in different locations (e.g.,
sensors, clouds) and aggregating and/or processing all data in a fusion center
might be inefficient or unfeasible, due to resource limitations, communication
overheads or privacy issues. We develop a unified decentralized algorithmic
framework for this class of nonconvex problems, which is proved to converge to
stationary solutions at a sublinear rate. The new method hinges on Successive
Convex Approximation techniques, coupled with a decentralized tracking
mechanism aiming at locally estimating the gradient of the smooth part of the
sum-utility. To the best of our knowledge, this is the first provably
convergent decentralized algorithm for Dictionary Learning and, more generally,
bi-convex problems over (time-varying) (di)graphs
Residual Energy Based Cluster-head Selection in WSNs for IoT Application
Wireless sensor networks (WSN) groups specialized transducers that provide
sensing services to Internet of Things (IoT) devices with limited energy and
storage resources. Since replacement or recharging of batteries in sensor nodes
is almost impossible, power consumption becomes one of the crucial design
issues in WSN. Clustering algorithm plays an important role in power
conservation for the energy constrained network. Choosing a cluster head can
appropriately balance the load in the network thereby reducing energy
consumption and enhancing lifetime. The paper focuses on an efficient cluster
head election scheme that rotates the cluster head position among the nodes
with higher energy level as compared to other. The algorithm considers initial
energy, residual energy and an optimum value of cluster heads to elect the next
group of cluster heads for the network that suits for IoT applications such as
environmental monitoring, smart cities, and systems. Simulation analysis shows
the modified version performs better than the LEACH protocol by enhancing the
throughput by 60%, lifetime by 66%, and residual energy by 64%
Communication Ttechnologies for edge learning and inference: a novel framework, open issues, and perspectives
With the continuous advancement of smart devices and their demand for data, the complex computation that was previously exclusive to the cloud server is now moving towards the edge of the network. Due to numerous reasons (e.g., applications demanding low latencies and data privacy), data-based computation has been brought closer to the originating source, forging the Edge Computing paradigm. Together with Machine Learning, Edge Computing has turned into a powerful local decision-making tool, thus fostering the advent of Edge Learning. The latter, however, has become delay-sensitive as well as resource-thirsty in terms of hardware and networking. New methods have been developed to solve or, at least, minimize these issues, as proposed in this research. In this study, we first investigate representative communication methods for edge learning and inference (ELI), focusing on data compression, latency, and resource management. Next, we propose an ELI-based video data prioritization framework which only considers the data having events and hence significantly reduces the transmission and storage resources when implemented in surveillance networks. Furthermore, in this overview, we critically examine various communication aspects related to Edge Learning by analyzing their issues and highlighting their advantages and disadvantages. Finally, we discuss challenges and present issues that are yet to be overcome